Goto

Collaborating Authors

 Olomouc Region


Comparing LLM-generated and human-authored news text using formal syntactic theory

Zamaraeva, Olga, Flickinger, Dan, Bond, Francis, Gómez-Rodríguez, Carlos

arXiv.org Artificial Intelligence

This study provides the first comprehensive comparison of New York Times-style text generated by six large language models against real, human-authored NYT writing. The comparison is based on a formal syntactic theory. We use Head-driven Phrase Structure Grammar (HPSG) to analyze the grammatical structure of the texts. We then investigate and illustrate the differences in the distributions of HPSG grammar types, revealing systematic distinctions between human and LLM-generated writing. These findings contribute to a deeper understanding of the syntactic behavior of LLMs as well as humans, within the NYT genre.


Sociotechnical Effects of Machine Translation

Moorkens, Joss, Way, Andy, Lankford, Séamus

arXiv.org Artificial Intelligence

While the previous chapters have shown how machine translation (MT) can be useful, in this chapter we discuss some of the side-effects and risks that are associated, and how they might be mitigated. With the move to neural MT and approaches using Large Language Models (LLMs), there is an associated impact on climate change, as the models built by multinational corporations are massive. They are hugely expensive to train, consume large amounts of electricity, and output huge volumes of kgCO2 to boot. However, smaller models which still perform to a high level of quality can be built with much lower carbon footprints, and tuning pre-trained models saves on the requirement to train from scratch. We also discuss the possible detrimental effects of MT on translators and other users. The topics of copyright and ownership of data are discussed, as well as ethical considerations on data and MT use. Finally, we show how if done properly, using MT in crisis scenarios can save lives, and we provide a method of how this might be done.


Multimodal Deep Learning for Subtype Classification in Breast Cancer Using Histopathological Images and Gene Expression Data

Shandiz, Amin Honarmandi

arXiv.org Artificial Intelligence

Molecular subtyping of breast cancer is crucial for personalized treatment and prognosis. Traditional classification approaches rely on either histopathological images or gene expression profiling, limiting their predictive power. In this study, we propose a deep multimodal learning framework that integrates histopathological images and gene expression data to classify breast cancer into BRCA.Luminal and BRCA.Basal / Her2 subtypes. Our approach employs a ResNet-50 model for image feature extraction and fully connected layers for gene expression processing, with a cross-attention fusion mechanism to enhance modality interaction. We conduct extensive experiments using five-fold cross-validation, demonstrating that our multimodal integration outperforms unimodal approaches in terms of classification accuracy, precision-recall AUC, and F1-score. Our findings highlight the potential of deep learning for robust and interpretable breast cancer subtype classification, paving the way for improved clinical decision-making.


A Differentiable Alignment Framework for Sequence-to-Sequence Modeling via Optimal Transport

Kaloga, Yacouba, Kumar, Shashi, Motlicek, Petr, Kodrasi, Ina

arXiv.org Machine Learning

Accurate sequence-to-sequence (seq2seq) alignment is critical for applications like medical speech analysis and language learning tools relying on automatic speech recognition (ASR). State-of-the-art end-to-end (E2E) ASR systems, such as the Connectionist Temporal Classification (CTC) and transducer-based models, suffer from peaky behavior and alignment inaccuracies. In this paper, we propose a novel differentiable alignment framework based on one-dimensional optimal transport, enabling the model to learn a single alignment and perform ASR in an E2E manner. We introduce a pseudo-metric, called Sequence Optimal Transport Distance (SOTD), over the sequence space and discuss its theoretical properties. Based on the SOTD, we propose Optimal Temporal Transport Classification (OTTC) loss for ASR and contrast its behavior with CTC. Experimental results on the TIMIT, AMI, and LibriSpeech datasets show that our method considerably improves alignment performance, though with a trade-off in ASR performance when compared to CTC. We believe this work opens new avenues for seq2seq alignment research, providing a solid foundation for further exploration and development within the community.


Iconicity in Large Language Models

Marklová, Anna, Milička, Jiří, Ryvkin, Leonid, Bennet, Ľudmila Lacková, Kormaníková, Libuše

arXiv.org Artificial Intelligence

Lexical iconicity, a direct relation between a word's meaning and its form, is an important aspect of every natural language, most commonly manifesting through sound-meaning associations. Since Large language models' (LLMs') access to both meaning and sound of text is only mediated (meaning through textual context, sound through written representation, further complicated by tokenization), we might expect that the encoding of iconicity in LLMs would be either insufficient or significantly different from human processing. This study addresses this hypothesis by having GPT-4 generate highly iconic pseudowords in artificial languages. To verify that these words actually carry iconicity, we had their meanings guessed by Czech and German participants (n=672) and subsequently by LLM-based participants (generated by GPT-4 and Claude 3.5 Sonnet). The results revealed that humans can guess the meanings of pseudowords in the generated iconic language more accurately than words in distant natural languages and that LLM-based participants are even more successful than humans in this task. This core finding is accompanied by several additional analyses concerning the universality of the generated language and the cues that both human and LLM-based participants utilize.


Personalization of Large Language Models: A Survey

Zhang, Zhehao, Rossi, Ryan A., Kveton, Branislav, Shao, Yijia, Yang, Diyi, Zamani, Hamed, Dernoncourt, Franck, Barrow, Joe, Yu, Tong, Kim, Sungchul, Zhang, Ruiyi, Gu, Jiuxiang, Derr, Tyler, Chen, Hongjie, Wu, Junda, Chen, Xiang, Wang, Zichao, Mitra, Subrata, Lipka, Nedim, Ahmed, Nesreen, Wang, Yu

arXiv.org Artificial Intelligence

Personalization of Large Language Models (LLMs) has recently become increasingly important with a wide range of applications. Despite the importance and recent progress, most existing works on personalized LLMs have focused either entirely on (a) personalized text generation or (b) leveraging LLMs for personalization-related downstream applications, such as recommendation systems. In this work, we bridge the gap between these two separate main directions for the first time by introducing a taxonomy for personalized LLM usage and summarizing the key differences and challenges. We provide a formalization of the foundations of personalized LLMs that consolidates and expands notions of personalization of LLMs, defining and discussing novel facets of personalization, usage, and desiderata of personalized LLMs. We then unify the literature across these diverse fields and usage scenarios by proposing systematic taxonomies for the granularity of personalization, personalization techniques, datasets, evaluation methods, and applications of personalized LLMs. Finally, we highlight challenges and important open problems that remain to be addressed. By unifying and surveying recent research using the proposed taxonomies, we aim to provide a clear guide to the existing literature and different facets of personalization in LLMs, empowering both researchers and practitioners.


Analyzing Context Contributions in LLM-based Machine Translation

Zaranis, Emmanouil, Guerreiro, Nuno M., Martins, André F. T.

arXiv.org Artificial Intelligence

Large language models (LLMs) have achieved state-of-the-art performance in machine translation (MT) and demonstrated the ability to leverage in-context learning through few-shot examples. However, the mechanisms by which LLMs use different parts of the input context remain largely unexplored. In this work, we provide a comprehensive analysis of context utilization in MT, studying how LLMs use various context parts, such as few-shot examples and the source text, when generating translations. We highlight several key findings: (1) the source part of few-shot examples appears to contribute more than its corresponding targets, irrespective of translation direction; (2) finetuning LLMs with parallel data alters the contribution patterns of different context parts; and (3) there is a positional bias where earlier few-shot examples have higher contributions to the translated sequence. Finally, we demonstrate that inspecting anomalous context contributions can potentially uncover pathological translations, such as hallucinations. Our findings shed light on the internal workings of LLM-based MT which go beyond those known for standard encoder-decoder MT models.


A Target-Aware Analysis of Data Augmentation for Hate Speech Detection

Casula, Camilla, Tonelli, Sara

arXiv.org Artificial Intelligence

Hate speech is one of the main threats posed by the widespread use of social networks, despite efforts to limit it. Although attention has been devoted to this issue, the lack of datasets and case studies centered around scarcely represented phenomena, such as ableism or ageism, can lead to hate speech detection systems that do not perform well on underrepresented identity groups. Given the unpreceded capabilities of LLMs in producing high-quality data, we investigate the possibility of augmenting existing data with generative language models, reducing target imbalance. We experiment with augmenting 1,000 posts from the Measuring Hate Speech corpus, an English dataset annotated with target identity information, adding around 30,000 synthetic examples using both simple data augmentation methods and different types of generative models, comparing autoregressive and sequence-to-sequence approaches. We find traditional DA methods to often be preferable to generative models, but the combination of the two tends to lead to the best results. Indeed, for some hate categories such as origin, religion, and disability, hate speech classification using augmented data for training improves by more than 10% F1 over the no augmentation baseline. This work contributes to the development of systems for hate speech detection that are not only better performing but also fairer and more inclusive towards targets that have been neglected so far.


ToxiCraft: A Novel Framework for Synthetic Generation of Harmful Information

Hui, Zheng, Guo, Zhaoxiao, Zhao, Hang, Duan, Juanyong, Huang, Congrui

arXiv.org Artificial Intelligence

In different NLP tasks, detecting harmful content is crucial for online environments, especially with the growing influence of social media. However, previous research has two main issues: 1) a lack of data in low-resource settings, and 2) inconsistent definitions and criteria for judging harmful content, requiring classification models to be robust to spurious features and diverse. We propose Toxicraft, a novel framework for synthesizing datasets of harmful information to address these weaknesses. With only a small amount of seed data, our framework can generate a wide variety of synthetic, yet remarkably realistic, examples of toxic information. Experimentation across various datasets showcases a notable enhancement in detection model robustness and adaptability, surpassing or close to the gold labels. We release the generated data at Github upon acceptance.


Optimizing Rare Word Accuracy in Direct Speech Translation with a Retrieval-and-Demonstration Approach

Li, Siqi, Liu, Danni, Niehues, Jan

arXiv.org Artificial Intelligence

Direct speech translation (ST) models often struggle with rare words. Incorrect translation of these words can have severe consequences, impacting translation quality and user trust. While rare word translation is inherently challenging for neural models due to sparse learning signals, real-world scenarios often allow access to translations of past recordings on similar topics. To leverage these valuable resources, we propose a retrieval-and-demonstration approach to enhance rare word translation accuracy in direct ST models. First, we adapt existing ST models to incorporate retrieved examples for rare word translation, which allows the model to benefit from prepended examples, similar to in-context learning. We then develop a cross-modal (speech-to-speech, speech-to-text, text-to-text) retriever to locate suitable examples. We demonstrate that standard ST models can be effectively adapted to leverage examples for rare word translation, improving rare word translation accuracy over the baseline by 17.6% with gold examples and 8.5% with retrieved examples. Moreover, our speech-to-speech retrieval approach outperforms other modalities and exhibits higher robustness to unseen speakers. Our code is publicly available (https://github.com/SiqiLii/Retrieve-and-Demonstration-ST).